Goto

Collaborating Authors

 parallel gpu task scheduling


Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Neural Information Processing Systems

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling.


Review for NeurIPS paper: Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Neural Information Processing Systems

Weaknesses: This work is most applicable on networks with many small kernels, which may not be of broad interest in all cases. Nonetheless, it does help with training MobileNet and similar networks on desktop or server GPUs. I also feel that some parts of the paper overstate the contribution, either by only evaluating on these networks or by leaving out some optimized baselines. The biggest issues here are: - For inference, you should compare against an optimized inference runtime such as TensorRT. This will likely do better than PyTorch or Caffe2 do out of the box, even with TorchScript.


Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Neural Information Processing Systems

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling. Evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34 and 3.61, respectively.